wo 2«()5/«l»6621 PCT/EP2(K)4/()(]68US 

' iAP2a Mi'i mfm 2 9 dec m 

System and method for determiuiug clock skew iu a packet-based telephony session 

The present invention relates to a system and method for determining clock skew in a packet- 
based telephony session. 

5 

Traditional telephony via the PSTN (PubUc Switched Telephone Network) reserves 
bandwidth in advance of a call and dedicates that bandwidth for the duration of the call. 
Additionally, it presei-ves the timing relationsliips in speech between sender and receiver 
tlnrough use of a cominon precise clock. This means that the speech is encoded at the sender 
10 exchange (with a 125 microsecond sample period), transmitted across the network and 
decoded at the receiver exchange with both encoding/decodmg processes essentially 
s>aicln"onised because they share a conmion clock). 

Packet-based telephony, in particular Voice over ff (VoIP), employing local area networks 
1 5 (LANs), wide ai^ea net^^'orks (WANs) or the Lrtemet, on the other hand splits data into 

packets and transmits them independently of one another. However, tiansmitting multimedia 
data over packet-based netv\wks introduces problems if the temporal relationship between 
adjacent packets at the sender cannot be maintained and reconsti\icted at the receiver. The 
trend towards Voice over IP (VoIP) in recent yeai's has raised a range of complexities, in 
20 particular, resulting fi^om the lack of a common clock- 

These problems are described with reference to Figure 1, where two hitemet telephony 
devices 10-A and 10-B comprising, for example, a standard PC or IP phone rim respective 
telephony applications 14. These can be voice-only applications or can be voice and video 
25 apphcations. (For video applications, the device will also include a video card (not shown)..) 
During a session, each apphcation 14 sends and receives packets of multi-media information 
across the hitemet 12 and temporarily stores the received packets of infomiation in an 
associated application buffer 16. 

30 In the case of voice infomiation, a codec IS talces received packets from the buffer 16 and 
decodes the packet information to provide more binary like information for storing in a 
receive portion of buffer 26 in an audio card 20 located in or associated with the telephony 
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device. The audio card 20 then replays the received infomiation thiough foi example, 
speaker(s) 30 or headphones connected to the audio cai'd 20. 

Soimd received from a microphone or headset 32 is recorded by the audio card and is stored 
5 in a transmit portion of the buffer 26. Tliis is encoded by the codec 1 8 and transmitted to the 
receiveu 

The receive portions of one or both of the buffer 16 and 26 are emploj'ed to counter tlie 
effects of the potentially higlily variable delay rate for packets, known as jitter, caused by the 
10 hitemet's best-effort service. These buffers absorb jitter by accumulating incoming packets, 
helping to ensure that playout is periodic and thus of good quaUty. 

Each telephony device 10, typically contains a number of relatively low-giade oscillator 
crystals, among them the system clock crystal 24 to maintain system time, and an audio clock 
15 crystal 22, to set the sample periods for recording prior to encoding and for playback of 

decoded infomiation. Such oscillator crystals can have inherent firequency errors greater than 
a few hundred parts-per-niilhon resulting in accumulated errors of tens of seconds per day- 
For tlie puiposes of the present application, the temi "clock skew" is defined as tliis 
difference in a crystal's actual oscillator frequency fiom its nominal frequency. 

20 

Although the tate at which voice is recorded for encoding by the sender and played out after 
decodmg by the receiver is pwely deteimined by the audio card clock, the system clock is 
also used if for example packet-delay measurements are requu ed, wliich is often tlie case. As 
such, there are often foui separate clocks contributing to tire session, each with its unique 
25 skew as illustrated in Figure 2. 

The NTP protocol (Network Time Piotocol) employs numerous primary and secondary 
servers available tlirough the hiteniet that are synclironized to Coordinated Universal Time 
(UTC) via radio, satellite or modem. This protocol enables the syncluonisation of system 
30 clocks 24 across the Internet, Alternatively, as disclosed in US Patent No. 6,360,271, GPS 
clocks can be used to syncliionise system clocks. The effect of synclironizing the system 
clocks 24 is to eliminate the effects of the deviation of the respective system clocks fi-om theh 
nominal frequency, i.e. system clock skew. 
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Still, a number of skew-related problems caii arise: 

Firstly, and with reference to packets being transmitted from device 10- A to 10-B, If the 
5 sender audio clock 22-A operates faster than receiver audio clock 22-B, this will lead to 
packet accumulation in one or other of the receive portions of the buffers 16-B, 26-B. This 
lesults in liigher buffer residency delays and possibly buffer overflow (packet loss)- If the 
sender audio clock 22-A operates at slower than clock 22-B, it will result in underfill of one 
or both of buffers 16-B, 26-B . Of course, the same applies for audio clock 22-B and the 
10 buffer 16-A, 26-A. Thus, if the receiver audio clock rate differs from the sender audio clock 
rate, then the receiver buffer(s) will eithei gradually fill or empty. 

Secondly, in order to absorb the effects of network jitter, many VoIP applications utihse 
adaptive buffering approaches. These applications need to estimate changes in one-way 

15 delays and react accordingly. Other approaches use synclironised time for precise per-packet 
delay measurement, see for example H^Melvin and L.Muiphy, "An evaluation of the use of 
synclironised time within a hybrid fixed-adaptive playout VoIP application". Proceedings of 
IEEE hitl Conference on Cormnunications 2003, Aachorage, Alaska, May. 2003 (Melvin et 
al). However, as outlined above, the rate at which packets are sent by the sender is solely 

20 detemiined by the sender audio card clock 22 (and not the sender system clock 24), 

Agam, with reference to packets being ti ansmitted from device 10-A to 10-B, if the sender 
audio clock rate 22-A (which determines the rate at which packets are sent) is different fiom 
the receiver system clock 24-B (which timestamps packet arrivals to estimate delays), tlris 
25 will manifest itself in an apparent gi adual increase or decrease in one-way delay. Thus skew 
bet^^^een the sender audio card 22-A and receiver system clock 24-B will distort such 
measurements and tlius the play-out mechanism and ultimately soimd quahty, 

A number of approaches to resolving audio card clock skew between sender and receiver in a 
30 Voff session have been proposed. O llodson, CPerldns, and V.Hardman, "Skew Detection 
and Compensation for hiternet Audio Applications", Proceedings of the IEEE hit'l 
Conference on Multimedia and Expo.,NY, July 2000; and R.Alcester, and S.Hailes, "A New 
Audio Skew Detection and Con ection Algoritlim", Proceedings of the IEEE hitl Conference 
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on Multimedia and Expo ., Lausanne, Aug. 2002 both disclose utilising a low level mechanism 
that measures audio skew by monitoring the data flow through the receiver-device i.e. audio 
card buffers 26-A, 26-B and thus involve low level programming and manipulation of audio 
card drivers. 

5 

Because, these approaches require low-level laiowledge and manipulation of audio caid 
hardware/software, although the concepts are imiversally applicable, implementation details 
will thus be product-specific. Additionally the mechanism used to measure audio skew is 
subject to 'noise' from network jitter and thus can return wrong results and thus respond 
10 inappropriately unless such noise is filtered out Such filtering is a non-trivial problem. 

According to the present invention there is provided a method accoiding to claim 1. 

The present invention can be implemented at a higher level than disclosed in the prior art and 
15 can utihse existing Internet protocols.. In the preferred embodiment, audio skew is measured 
tlnough a combination of RTP (Realtime Transport Protocol) Conti^ol Protocol (RTCP) 

Sender Repoil (SR) packets and use of'NTP (Network Time Protocol) and is thus unaffected 
by network jitter. As such the mechanism will operate regardless of tlie underlying 
hardware/software. 

20 

Additionally, the preferred embodiment faciUtates the effective implementation of 
synclu"onised time, by detemiining skew betv^'een a sender audio clock and a receiver system 
clock wliich will degr ade the benefits of syncluonised tune, and tliis can in turn lead to more 
effective playout strategies. 

25 

Embodiments of the invention will now be described, by way of example, with reference to 
the accompanying drawings, in wliich: 

Figure 1 is a schematic diagram illustrating the components involved in a packed-based 
30 telephony session; 

Figure 2 illustrates the effect on samphng of clock skew for the audio and system clocks of 
Figure 1; and 
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Figure 3 illustrates the infomiation included in RTP aiid RTCP protocol packets for 
transmitting infomiation between the devices of Figure 1. 

5 The prefeiTcd embodiment of the present invention is implemented in packed-based 
telephony applications of the tyj^e shown in Figine L The prefeixed embodiment uses 
existing Internet protocols aheady employed by the applications 14 to mitigate the effects 
outlined above of clock skew, 

10 Refening now to Figure 3, which shoves the header infonnation for various packets 

transmitted by the multi-media telephony applications 14.. RTP is an example of an Litemet 
protocol used by such apphcatiojis to deliver muhimedia data. See H.Schulzrimie, S.Casner, 
R.Frederick, and VJacobson "RTP: A Transport Protocol for Realtime Apphcations," 
hitemet Engineering Task Force RFC 1 SS9, Jan. 1 996 for fiirther infonnation on RTP and the 

15 companion protocol RTCP. 

For the purposes of the present appUcation, each RTP packet includes an RTP header which 
in tum includes a sequence number (SQ) which is incremented for each RTP packet sent and 
a timestamp (TS) indicating the sampling instant of the first octet in the RTP data packet. 
20 These enable a receiver to accurately reconstaict media packets for playout. The tiraestamps 
are media specific and, m the case of voice data packets, the timestamps TSa include the 
sample number generated by the codec incremented at a rate detemiined by the audio card 
clock. 

25 Thus, in Figure 3, the device 10- A txansmits a sequence of audio packets in RTP format. 
Audio packet RTPa #n will have a sequence number (SQA#n) corresponding to n, and the 
time-stamp of the audio clock (TSa#x) at the instant the packet was created. The audio packet 
RTPA#n+m will have an audio clock time stamp a given nimiber of audio clock samples y 
after the time stamp for audio packet RTPa #n. 

30 

In a multi-media telephony apphcation (eg videoconferencing with audio/video), at the same 
tune, the codec IS encodes RTP packets for infomiation received from the video card. The 
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sequence of video packets and their respective time-stamps are independent of tliose for the 
audio packets as they are based on video card clock samples. 

As mentioned above, RTCP is a companion control protocol for RTP. RTCP SR packets are 
5 generated periodically for each media stream received by devices that are also senders. Thus, 
in multi-media telephony applications, during the lifetime of a media session, each sender 
periodically generates both audio (A) and video (V) RTCP SR packets and sends them to 
each receiving device. For the purposes of the present application, RTCP SR packets can be 
thouglit of as mcluding tvvo timestamps that are used especially in multimedia telephony to 
10 enable a receiver to sycluronize audio and video packets and provide lip-synch. The 

tmiestanips ar e the system clock timestamp (in NTP fomiat) indicating when the SR packet 
was generated, along with the coixesponding RTP timestamp which is in the same format at 
the time-stamps TS m the RTP packets and thus detemiined by the audio or video cai'd clock. 
Tliis enables a receiver to match received audio packets with received video packets 
15 produced at tlie same tune by a sender. 

The prefeaed embodiment employs RTCP SR audio packets even when there is no video 
stream with which to syncluonise the audio packets . The prefeiTcd embodiment is based on 
the reaUsation tliat if both system and audio card clocks are running at the same deviation 
20 from nominal on a given device, the time increment derived from respective RTP and NTP 
timestamps in successive RTCP SR audio packets will be equal. For example, if the interval 
between RTCP SR packets is 10 seconds according to the NTP timestamps, and if the audio 
clock card sample intei-val is 125 microseconds, the RTP timestamp increment should be 
80000. 

25 

However, any difference in the interval defined by tlie successive RTP and NTP tune-stamps 
indicates to the sender (and receiver) of the RTCP SR packets, the skew between audio card 
and system clock rates within the sending machine. So for example, if the audio card clock 
rate of the device 10-A is ruiming faster than system clock 24- A, the time-stamp number's for 
30 the RTPa components of RTCP SR packets sent 10 seconds apart (according to its system 
clock 24-A) will rim in excess of 80000. Referring to Figure 2, this enables the device 10-A 
to determine the relative relationsliip between the lines 22-A and 24-A (coiresponding to the 
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clocks 22-A and 24- A); and the device 10-B to deteraiiiie the relative relationship between 
the lines/clocks 22-B and 24-B. 



At the same time, each receiver can accumulate timestamp information contained witliin 
5 successive RTCP SR packets from the sender. This is conventionally used to enable the 
sender to calculate the roimd trip time, and also provides feedback to the sender relating to 
the quahty of the session as seen by the receiver. Hov^'ever, in the pxefeiTed embodiments, 
any deviation of the audio clock card sampling rate from the system clock rate indicated by 
the NTP time stamps, enables each receiver to pi*^ecisely and quiclcly detemiine the skew 
10 value between a sender's system and audio card clocks. Refemng to FigiU'e 2, tliis enables the 
device 10-A to detemiine the relative relationship between the lines/clocks 22-B and 24-B 
and the device 10-B to determme the relative relationsliip bet^'een the lines/clocks 22-A and 
24-A.. 



15 In the prefened embodiment, system clocks are syncluonised, for example, via the Internet 
protocol NTP or any other suitable mechanism. Melvin et al show that NTP will provide 
millisecond-level synch on Local Area Networks and well provisioned Wide Area Networks. 
If not exphcitly syncluonised, then the implementation is based on the assumption that the 
clocks 24'-A and 24-B of Figiues 1 are relatively syncluonous and that the implementation is 

20 used to mitigate the effects of audio clock card skew, where the degr ee of audio clock card 
skew is assumed to be worse than system clock card skew^ 

In any case, knowing or assuming that the system clocks are synchronised, and knowing the 
relationsliip behveen the lines 22-'A, 24- A (or for 10-A the relationship between lines 22-B, 
25 24-B), each receiver can detemiine the skew between a sender audio clock and the receiver 
system clock, i.e, for lO-B the relationship between the clocks/lines 22-A and 24-B; and for 
10-A, that between 22-B and 24-A respectively. 

This combination of RTCP and NTP enables each receiver to detenuine precisely what 
30 compensating factor needs to be appHed to incoming packets to avoid the gradual distortion 
of one-way delay that othei"\\ase will coiTupt the performance of adaptive playout algoritiims 
and playout strategies based on s;iaicIii*omsed time. 
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Fuifhermore, by examining its own RTCP SR packets being generated for transmission, the 
receiver can determine the skew between its own audio and system clocks. From an analysis 
of successive RTCP packets (incoming and generated), each receiver can therefore generate a 
precise pictur e of all fom clock rates and implement appropriate compensatory action, 

5 

Thus, the prefeired embodunent solves two problems: it detects audio-audio clock skew 
which can cause buffer under/overfill and also detects delay measmement skew, enabling 
playout quality to be optimised, for example, by implementing the hybrid playout algoritlmi 
as described by Melvin et al. 

10 

It will be seen that for audio-audio skew, once die skew value is determined, some 
mechanism is required to compensate for such skew. Hodson et al outline a solution tliat 
inserts/deletes appropriate samples within the receive portion of tlie audio card buffer 26 to 
compensate for such skew whereas Akester at al attempt to match the receiver audio clock 
15 rate to that of tlie sender. Alternatively, tlie appKcation 14 could delete or pad entire packets 
witliin the receive portion of the buffer 16, again ensuring that the mvention can be 
completely implemented at an application level. 

It will be seen that wlrile the preferred embodunent has been described in terms of specific 
20 Internet protocols, the invention is not so limited and is applicable where a determination can 
be made by a device firom packets received from anotlier device of the audio caid skew of the 
other device. 

In tliis regaid, it will be seen tlrat while the embodiment has been described in terms of RTCP 
25 contiol packets carrying the contiol inf ormation required to implement die invention for RTP 
media packets, tlie invention could be implemented where tlie media packets also contain the 
required control infoimation. Thus, media packets may in fact contaui conti^ol infomiation or 
indeed control packets could contain media information. 



